This project dives into the diverse world of Airbnb rental prices, attempting to predict them based on available data. With over 40,000 Airbnb listings across European cities, each listing is a mix of different features like number of bedrooms, location, and reviews, making pricing quite complex. The main question for this analysis: Can we predict Airbnb prices based on other variables?
Cost of accommodation per night, Euros(€) (Predicted Sum vs. Real Sum)
I wanted to create a linear regressing model to attempt to predict the
accommodation prices, because there was around 50,000 datapoints I knew
that I had to use a programming language. Python is the natural choice
as already had large experience with it and it is more then capible then
handling a date set of this size.The other sensible option that I could
have used was R.
I used the Pandas and Sklearn libraries to build the model. Once the
model was built I used python to calculate the predicted results for
each data point. I then exported the original data along with the
predicted data to create the above plot using Tableau. R-Squared=0.949
Points above the reference line indicate overestimation by the model (the predicted price is higher than the actual price), while points below represent underestimation (the predicted price is lower than the actual price).
You would expect a plot for data that fits the model to be symmetrical
about a residual of zero and for the plot to show no clear trend. But as
you can see the above plot is neither of those things. We can see clear
overestimations for properties that cost over 600 euros. Even though the
trend is increasingly more biased toward overestimation the higher the
Real Sum there appears to be a somewhat linear relationship on this
graph. Despite an R-Squared value of 0.949, the pattern suggests that
the model is not fully capturing the underlying relationship between the
variables.
The next step is to experiment with polynomial regression, as the
evidence suggest that the model is underfitted.